Maria Bochenek
In this assigment we will again use QSAR-biodegradation dataset LINK for binary classification task. The dataset has 17 features and 1055 rows. We changes classes labels from $\{1, 2\}$ to $\{0, 1\}$ for consistency with the procedure in HW1 (it was necessary to use BCELoss while training Neural Network Classifier). Class 0 has $699$ instances and class 1 has $356$, thus this dataset's imbalance ratio is equal to $1.96$. The task is to distinguish ready (class 1) and not ready (class 0) biodegradable molecules based on molecular stucture descriptors.
As our main tree-based ensemble model we selected sklearn.ensemble.GradientBoostingClassifier, which was one of better performing models based on analysis from HW1. To extend the scope of our analysis we included another tree-based ensemble model sklearn.ensemble.RandomForestClassifier, and two models with worse performance (based on HW1) for comparison: sklearn.linear_model.LogisticRegression with LBFGS solver and sklearn.svm.SVC with radial basis function (RBF) kernel.
We set random seed for reproductibility and split dataset into training and validation datasets 80:20. We performed feature scaling using sklearn.preprocessing.StandardScaler. Model performance is presented in table below.
| recall | precision | f1 | accuracy | auc | |
|---|---|---|---|---|---|
| GradientBoostingClassifier | 0.898305 | 0.688312 | 0.779412 | 0.85782 | 0.92568 |
| RandomForestClassifier | 0.830508 | 0.720588 | 0.771654 | 0.862559 | 0.931813 |
| LogisticRegression | 0.762712 | 0.714286 | 0.737705 | 0.848341 | 0.899309 |
| SVC | 0.79661 | 0.671429 | 0.728682 | 0.834123 | 0.822647 |
We colculated Permutation-based Variable Importance for sklearn.ensemble.GradientBoostingClassifier model. The results are presented in the Figure below.
Permutation-based Importance Values show high importance of variable $V36$, previously identified in HW4 as the most important variable for observation $1002$ from class 0. The second and third most important variables according to PVI values are $V18$ and V$22$ respectively. Those variables were the first and second in order of importance according to LIME decomposition (HW3) for observations $1002$, $626$ (class 0), and $253$ (class 1), while sixth and first for observation $279$ (class 1). Moreover, while PDP plots for $V36$ did not suggest its high importance the ones for $V18$ and $V22$ did (HW5). Additionally, we can spot that $V39$ and $V1$ round up the top 5 variables with the highest importance according to PVI and both have a similar impact on drop-out loss change of approximately $0.028$. The next (sixth) variable $V27$ has an impact smaller by 0.005 from the $V39$ and $V1$, while the one after it (seventh) has an impact smaller by as much as 0.016.cSo the biggest change in impact level is between the first and second place and between sixth and seventh.
Furthermore, we also inspected variable importance level relative to the baseline value. The top 10 ranking was identical in both cases.
For the goal of comparint variable importance between models we used variable importance relative to baseline value (ratio). The results of top 5 most important variables for each model are presented in the figure below.
The first notable difference is that the top 5 variables' impact on drop-out loss change for the LogisticRegression model is significantly higher than the impact of the top 5 variables for other models, with only $V36$'s (1st) impact for GradientBoostingClassifier being higher than $V14$'s (5th) impact for LogisticRegression. This is expected as Logistic Regression models are comparatively simpler and often rely on a few highly important variables for predictions and are very sensitive to their change.
Both tree-based ensemble models had the same set of the top 5 most important variables and the same most important variable: $V36$, however order of the other ones differed. Impact levels of the top 5 variables on drop-out loss for the RandomForestClassifier were the lowest out of the two tree-based models with even the impact of the first variable being lower than the impact of the 5th variable for the GradientBoostingClassifier.
Overall the lowest impact on drop-out loss had the top 5 variables of the Support Vector Machine Classifier (SVC), the differences between these variables' impacts were also the smallest of all models. LogisticRegression and SVC both had variables $V13, V22, V1, V13$ in their top 5, however, the most important for SVC was $V39$ (2nd and 4th for RandomForestClassifier and GradientBoostingClassifier respectively) and 5th most important for LogisticRegression was $V14$, which was outside of top 5 for all the other models.
Variables $V1, V18$, and $V22$ were present in the top 5 for all the models. Those three were repeatedly identified as important in previous homework both locally and globally.
For GradientBoostingClassifier we compared Permutation-based variable Importance with the traditional importance measure for tree-based models: Gini impurity. The results are presented in the plot below (they are sorted based on PVI values).
We can clearly see the high importance of $V36$ in making predictions by GradientBoostingClassifier based both on PVI and Gini Impurity. That means that the misclassification probability based on the $V36$ node is low.
It is important to note that Gini Impurity and PVI are different measures so comparing their absolute values doesn't make sense, however, we can definitely deduce the high importance of $V36, V18, V22, V39, V1$ features since they both have high values with respect to both measures (all of them are in top 6 for both measures). Inspecting Gini Impurity can also suggest that inspecting $V12$’s impact on predictions might be important which would not be obvious based on PVI alone since it's 8th in the order of importance.
It's vital to remember that impurity-based feature importances such as Gini-based Variable Importance can be misleading for high cardinality features (many unique values). Then Permutation-based methods can be a more reliable choice since they do not have bias toward high-cardinality features and can be computed only on a test set.
We computed SHAP variable importance for GradientBoostingModel. The results can be seen in the figure below.
SHAP values show which features are highly discriminatory in predicting the output. Generally, the high values of the most important variable according to SHAP ($V22$) have a high impact on the model assigning class 0 to observation, while low values are rather responsible for predicting class 1. However, it can be seen that this type of separation is not really clear for most of the top variables (we can spot a lot of purple as well as pink between blues, etc). Moreover, we can infer that high values of $V12$ (5th in importance) and low values of $V30$ will be predictive of class 0.
When comparing SHAP variable importance with PVI it can be seen that the set of top 4 most important model features are the same for both PVI and SHAP, however, the top 3 have a different order. The 5th most important variable based on SHAP value is $V12$ (3rd most important based on Gini Impurity) and $V1$ based on PVI (2nd most important based on Gini Impurity). Therefore, to identify the same set of the most important variables we would need to use at least two different importance measures.
It's worth noting that while computing the SHAP values with dalex using TreeSHAP we needed to ignore that some of the values weren't exactly additive. Thus, the presented results might not be reliable.
import dalex as dx
import lime
import shap
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import random
import warnings
warnings.filterwarnings("ignore")
import plotly
plotly.offline.init_notebook_mode()
url = "https://raw.githubusercontent.com/adrianstando/imbalanced-benchmarking-set/main/datasets/qsar-biodeg.csv"
df = pd.read_csv(url,index_col=0)
# change target labels 2->1 and 1->0
df["TARGET"]-= 1
display(df.head())
X = df.loc[:, df.columns != 'TARGET']
y = df["TARGET"]
| V1 | V2 | V8 | V12 | V13 | V14 | V15 | V17 | V18 | V22 | V27 | V28 | V30 | V31 | V36 | V37 | V39 | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.919 | 2.6909 | 31.4 | 0.000 | 3.106 | 2.550 | 9.002 | 0.960 | 1.142 | 1.201 | 1.932 | 0.011 | 0.000 | 4.489 | 2.949 | 1.591 | 7.253 | 1 |
| 1 | 4.170 | 2.1144 | 30.8 | 0.000 | 2.461 | 1.393 | 8.723 | 0.989 | 1.144 | 1.104 | 2.214 | -0.204 | 0.000 | 1.542 | 3.315 | 1.967 | 7.257 | 1 |
| 2 | 3.932 | 3.2512 | 26.7 | 0.000 | 3.279 | 2.585 | 9.110 | 1.009 | 1.152 | 1.092 | 1.942 | -0.008 | 0.000 | 4.891 | 3.076 | 2.417 | 7.601 | 1 |
| 3 | 3.000 | 2.7098 | 20.0 | 0.000 | 2.100 | 0.918 | 6.594 | 1.108 | 1.167 | 1.024 | 1.414 | 1.073 | 8.361 | 1.333 | 3.046 | 5.000 | 6.690 | 1 |
| 4 | 4.236 | 3.3944 | 29.4 | -0.271 | 3.449 | 2.753 | 9.528 | 1.004 | 1.147 | 1.137 | 1.985 | -0.002 | 10.348 | 5.588 | 3.351 | 2.405 | 8.003 | 1 |
# Spliting data
SEED = 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)
scaler = StandardScaler().fit(X_train)
#scaling - transforming
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# convert back to dataframes
X_train = pd.DataFrame(X_train, columns = X.columns, index = y_train.index)
X_test = pd.DataFrame(X_test, columns = X.columns, index = y_test.index)
models = {
'GradientBoostingClassifier': GradientBoostingClassifier(random_state=SEED, n_estimators=100),
'RandomForestClassifier': RandomForestClassifier(random_state=SEED, n_estimators=300, max_depth=None),
'LogisticRegression': LogisticRegression(random_state=SEED, penalty='l2', solver = 'lbfgs'),
'SVC': SVC(random_state=SEED, kernel='rbf')
}
explainers = {}
for name, model in models.items():
model.fit(X_train, y_train)
explainer = dx.Explainer(model, X_test, y_test)
# save explainer
explainers[name] = explainer
Preparation of a new explainer is initiated -> data : 211 rows 17 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 211 values -> model_class : sklearn.ensemble._gb.GradientBoostingClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x000001869623BD30> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.00988, mean = 0.358, max = 0.983 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.926, mean = -0.0782, max = 0.934 -> model_info : package sklearn A new explainer has been created! Preparation of a new explainer is initiated -> data : 211 rows 17 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 211 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x000001869623BD30> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0, mean = 0.351, max = 0.983 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.93, mean = -0.0716, max = 0.85 -> model_info : package sklearn A new explainer has been created! Preparation of a new explainer is initiated -> data : 211 rows 17 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 211 values -> model_class : sklearn.linear_model._logistic.LogisticRegression (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x000001869623BD30> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 9.77e-11, mean = 0.348, max = 0.98 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.968, mean = -0.0685, max = 0.994 -> model_info : package sklearn A new explainer has been created! Preparation of a new explainer is initiated -> data : 211 rows 17 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 211 values -> model_class : sklearn.svm._classes.SVC (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_default at 0x000001869623BC10> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0, mean = 0.332, max = 1.0 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -1.0, mean = -0.0521, max = 1.0 -> model_info : package sklearn A new explainer has been created!
performance = pd.concat([explainer.model_performance().result for explainer in explainers.values()], axis=0)
performance
| recall | precision | f1 | accuracy | auc | |
|---|---|---|---|---|---|
| GradientBoostingClassifier | 0.898305 | 0.688312 | 0.779412 | 0.857820 | 0.925680 |
| RandomForestClassifier | 0.830508 | 0.720588 | 0.771654 | 0.862559 | 0.931813 |
| LogisticRegression | 0.762712 | 0.714286 | 0.737705 | 0.848341 | 0.899309 |
| SVC | 0.796610 | 0.671429 | 0.728682 | 0.834123 | 0.822647 |
print(performance.to_markdown(index=True))
| | recall | precision | f1 | accuracy | auc | |:---------------------------|---------:|------------:|---------:|-----------:|---------:| | GradientBoostingClassifier | 0.898305 | 0.688312 | 0.779412 | 0.85782 | 0.92568 | | RandomForestClassifier | 0.830508 | 0.720588 | 0.771654 | 0.862559 | 0.931813 | | LogisticRegression | 0.762712 | 0.714286 | 0.737705 | 0.848341 | 0.899309 | | SVC | 0.79661 | 0.671429 | 0.728682 | 0.834123 | 0.822647 |
explainer = explainers['GradientBoostingClassifier']
np.random.seed(SEED)
pvi = explainer.model_parts(label = "GradientBoostingClassifier", random_state=SEED)
pvi.plot(digits=4,
show=False,
title="Permutation-based Variable Importance").update_layout(autosize=False, width=800, height=500)
pvi_ratio = explainer.model_parts(
type="ratio",
label = 'GradientBoostingClassifier VI ratio',
random_state=SEED,
)
np.all(pvi_ratio.result["variable"] == pvi.result["variable"])
True
pvi_ratio.plot(digits=4,
show=False,
title="Permutation-based Variable Importance ratio").update_layout(autosize=False, width=800, height=500)
What are the differences? Why?
np.random.seed(SEED)
pvi_all = [explainers[model].model_parts(random_state=SEED) for model in explainers]
pvi_all[0].plot(
pvi_all[1:],
digits=4,
max_vars=5,
split='model',
title="Comparison of perm-based VI for different models",
show=False
).update_layout(width=900)
np.random.seed(SEED)
pvi_ratio_all = [explainers[model].model_parts(type="ratio", random_state=SEED) for model in explainers]
pvi_ratio_all[0].plot(
pvi_ratio_all[1:],
digits=4,
max_vars=5,
title="Comparison of perm-based VI for different models",
show=False
).update_layout(width=900)
pvi_ratio_all[0].plot(
pvi_ratio_all[1:],
digits=4,
# max_vars=10,
split='variable',
title="Comparison of perm-based VI for different models",
show=False
).update_layout(width=900)
model = models['GradientBoostingClassifier']
model.feature_importances_
array([0.09628143, 0.02011998, 0.03414656, 0.08401102, 0.02618981,
0.02338471, 0.00254566, 0.00967464, 0.07057429, 0.06590164,
0.03958381, 0.00985456, 0.05683647, 0.01686283, 0.32390064,
0.05325321, 0.06687874])
df = pd.DataFrame({'variable': X_test.columns, 'Gini-based': model.feature_importances_})
df = df.merge(pvi.result[['variable', 'dropout_loss']])
df = df.rename(columns={"dropout_loss": "Permutation-based"})
df
| variable | Gini-based | Permutation-based | |
|---|---|---|---|
| 0 | V1 | 0.096281 | 0.101990 |
| 1 | V2 | 0.020120 | 0.083413 |
| 2 | V8 | 0.034147 | 0.086363 |
| 3 | V12 | 0.084011 | 0.083575 |
| 4 | V13 | 0.026190 | 0.072207 |
| 5 | V14 | 0.023385 | 0.076918 |
| 6 | V15 | 0.002546 | 0.074599 |
| 7 | V17 | 0.009675 | 0.074459 |
| 8 | V18 | 0.070574 | 0.110967 |
| 9 | V22 | 0.065902 | 0.106016 |
| 10 | V27 | 0.039584 | 0.097324 |
| 11 | V28 | 0.009855 | 0.078825 |
| 12 | V30 | 0.056836 | 0.074331 |
| 13 | V31 | 0.016863 | 0.070060 |
| 14 | V36 | 0.323901 | 0.130977 |
| 15 | V37 | 0.053253 | 0.074153 |
| 16 | V39 | 0.066879 | 0.102765 |
#sort by decresing Permutation based values
axes = df.sort_values(by=['Permutation-based'], ascending=False).plot.bar(x='variable',
rot=0,
figsize=(10, 7),
colormap=cm.tab10,
fontsize=11)
plt.title("Gini-based vs Permutation-based Variable Importance", fontsize=15)
plt.xlabel("Variable", fontsize=13)
plt.ylabel("Importance", fontsize=13)
plt.legend(fontsize=13)
plt.grid(linestyle = '--', alpha=0.7)
plt.show()
What are the differences? Why?
explainer = explainers['GradientBoostingClassifier']
np.random.seed(SEED)
shap_vi = explainer.model_parts(type="shap_wrapper",
shap_explainer_type="TreeExplainer",
check_additivity=False)
shap_vi.plot()